Mining Japanese Compound Words and Their Pronunciations from Web Pages and Tweets
نویسنده
چکیده
Mining compound words and their pronunciations is essential for Japanese input method editors (IMEs). We propose to use a chunk-based dependency parser to mine new words, collocations and predicate-argument phrases from largescale Japanese Web pages and tweets. The pronunciations of the compound words are automatically rewritten by a statistical machine translation (SMT) model. Experiments on applying the mined lexicon to a state-of-the-art Japanese IME system1 show that the precision of Kana-Kanji conversion is significantly improved.
منابع مشابه
Using a Chunk-based Dependency Parser to Mine Compound Words from Tweets
New words are appearing everyday in online communication applications, such as Twitter1. Twitter is the world’s most famous online social networking and microblogging service that enables its users to send/read text-based messages of up to 140 characters, known as “tweets”. Due to the facts that tweets are online typed (as fast as possible) within a limited number of characters, tweets are full...
متن کاملMHSubLex: Using Metaheuristic Methods for Subjectivity Classification of Microblogs
In Web 2.0, people are free to share their experiences, views, and opinions. One of the problems that arises in web 2.0 is the sentiment analysis of texts produced by users in outlets such as Twitter. One of main the tasks of sentiment analysis is subjectivity classification. Our aim is to classify the subjectivity of Tweets. To this end, we create subjectivity lexicons in which the words into ...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملUsing the Web to Train a Mobile Device Oriented Japanese Input Method Editor
This paper describes the construction of a Japanese Input Method Editor (IME) system for mobile devices, using the largescale Web pages. We provide the training process of our IME model, n-pos model for local Kana-Kanji conversion and ngram model for online cloud service. Especially, we propose an online algorithm of mining new compound words, together with the detailed post-filtering process t...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کامل